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Abstract 

Genome-wide association studies generate very large datasets that require scalable 
analysis algorithms. In this report we describe the GEDI software package, which 
implements efficient algorithms for performing several common tasks in the analysis 
of population genotype data, including genotype error detection and correction, im- 
putation of both randomly missing and untyped genotypes, and genotype phasing. 
Experimental results show that GEDI achieves high accuracy with a runtime scaling 
linearly with the number of markers and samples. The open source C+- 1- code of 
GEDI, released under the GNU General Public License, is available for download at 
http:/ /dna.engr. uconn.edu/software/GEDI/ , 

1 Introduction 

Genome-wide association studies (GWAS) have recently led to the discovery of hundreds 
of genes associated with complex human diseases [£>]. Each such study involves genotyping 
thousands of cases and controls at hundreds of thousands of single nucleotide polymorphism 
(SNP) loci, generating very large datasets that require scalable analysis algorithms. 

In this report we decribe GEDI, a software package implementing efficient algorithms for 
several common tasks in the analysis of GWAS data: 

• Genotype error detection and correction. Despite continuous progress in tech- 
nology and calling algorithms, errors remain present in high-throughput SNP genotype 
data [IB] at levels that can invalidate statistical test for disease association, particularly 
for haplotype-based methods |10j . GEDI implements the likelihood ratio approach to 
error detection from [9]. 

• Missing data recovery. High-throughput genotyping platforms leave uncalled large 
numbers of SNP genotypes. To complement quality control procedures that exclude 
SNP loci and samples with high proportions of missing genotypes, GEDI provides 
methods for maximum-likelihood inference of remaining missing genotypes. 
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• Imputation of genotypes at untyped SNP loci. Current genotyping platforms 
allow simultaneous typing of as many as a million SNP loci, but this is still just a 
fraction of the polymorphisms present in the human population. Imputation of geno- 
types at untyped SNP loci based on linkage disequilibrium information extracted from 
reference panels such as Hapmap [8] is often performed to increase statistical power 
of GWAS studies, see, e.g., [13]. Furthermore, imputation is critical for performing 
meta-analysis of datasets generated using different platforms [TT| . 

• Genotype phasing. Haplotype based association tests can improve statistical power 
compared to single-SNP approaches, but have seen limited use in the analysis of GWAS 
data, in part due to the lack of haplotype inference methods that are both accurate 
and scalable. In an attempt to fill in this gap, GEDI includes an implementation 
of the highly-scalable phasing algorithm of [5], based on entropy minimization. This 
algorithm has been recently used by [1J in conjunction with a haplotyping sharing 
approach to implicate in Parkinson's disease a novel gene missed by traditional single- 
SNP analyses. 

2 Statistical Model 

At the core of GEDI's algorithms is a factorial hidden Markov model (HMM) of multilocus 
SNP genotype data (Fig. [1]). Under this model, a multilocus genotype is formed by random 
pairing of haplotypes obtained as the result of historical recombination among K founder 
haplotypes, where K is a user-selected parameter. More specifically, the founder haplotypes 
of origin for SNP alleles on each autosomal chromosome are assumed to form a first order 
HMM whose parameters are estimated using the Baum- Welch algorithm based on reference 
haplotypes (for imputation of untyped SNP genotypes) or a pool consisting of reference 
haplotypes and haplotypes inferred from the genotype data itself (for error detection and 
missing data recovery). 

Formally, we denote by and 1 the major and minor alleles at every biallelic SNP locus, 
by 0, 1, and 2 the three possible SNP genotypes (homozygous major/minor, respectively 
heterozygous), and by '?' a missing SNP genotype. Multilocus SNP genotypes are modeled 




© © © - © 
Figure 1: GEDI's factorial HMM model for multilocus SNP genotype data. 
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statistically using the factorial HMM represented graphically in Fig. HJ Formally, for every 
SNP locus i £ {1, . . . , n}, we let H iy H- £ {0, 1} be random variables representing the alleles 
observed at this locus on the maternal (paternal) chromosome of the individual under study, 
and Fi, F( £ {1, . . . , K} be random variables denoting the founder haplotype from which Hi 
and H[ originate, respectively. Values taken by these random variables are denoted by the 
corresponding lowercase letters (e.g., g^ hi, fi, etc.). We set P M {g,\hi, h^) to 1 if gi — hi + h[ 
and otherwise. Probabilities Pm(/i), PM(fi+i\fi), and (^-il/i) are estimated using the 
classical Baum- Welch algorithm based on haplotypes inferred from a panel representing 
individual's population of origin, while ?m(/i), PM(fi+i\fi)i an d Phiih'^fj) are obtained by 
simply duplicating these estimates. We emphasize that GEDI estimates all HMM parameters 
based on the available data, without making any assumptions regarding the population under 
study. This underscores the robustness of its statistical model and its applicability to any 
studied population. 

We denote by g[gi <— x] the multilocus genotype obtained from g = (g x , . . . , g n ) by setting 
the value of the (possibly missing) 2-th SNP genotype to x. For a trained factorial HMM 
M, GEDI uses a forward-backward algorithm (see next section for details) to efficiently 
compute all probabilities PM{s[di ^~ x ])- GEDI performs error detection by computing 
the likelihood ratio max x P M (g[gi <— a;])/P M (g) for each non-missing SNP genotype g iy and 
flagging a genotype as a potential error whenever this ratio exceeds a user specified threshold. 
A missing SNP genotype at a typed SNP locus i is replaced by argmax x PM(g[5'i x]). 
Imputation of genotypes at an untyped SNP locus is performed similarly, except that in this 
case genotype probabilities are computed based on a "local" HMM model that spans the 
untyped locus and a user-specified number of typed SNP loci flanking it on each side. 



3 Algorithmic and Implementation Details 

GEDI computes multilocus genotype probabilities under a trained factorial HMM M us- 
ing an efficient forward-backward algorithm. Let g = (g\, . . . , g n ) £ {0, 1, 2, ?} n be a given 
multilocus genotype. For every i = 1, . . . , n, we define the forward and backward probabili- 
ties as Pf.j, = P M (gi, • • • , 9i-i, fi, fi) and B l ftJ , = P M (gi+h ■■■,9n\fi, fl), respectively. Then 

P M (s[gi ^ 1 x])=Ef l=1 Ef> =1 F) lJ >B%j,£^^^ 

Thus all probabilities Pm(s[9i *~ x ])> * — 1, - - - , n, x — 0,1,2, can be computed in 0{nK 2 ) 
once the forward and backward probabilities J 7 ^. ji and B l j. ji are available. 
The forward probabilities can be computed in 0(nK A ) using the recurrence: 

^hJi = PmU^PmU'x) (1) 

= E PM(fi\fi-x) E PMfU^ji >-0 (2) 

where fi,f[ £ {!,..., K}, i £ {2, . . . , n}. However, the inner sum in (T5]) is independent of 
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Figure 2: Sample dataset over 5 SNPs (a) and corresponding trie (b). 



fi, and so its repeated computation can be avoided by replacing (J2J) with: 

c %-un = E ^iLn rMUU)^ >-0 0) 
/' 1=1 

" i — i 

= E PMim^c^j, (4) 
/i~i=i 

By using recurrences dlj, (j3J), and (jl]), all forward probabilities can be computed in 0(nK 3 ) 
time. Similarly, backward probabilities can be computed in 0(nK 3 ) time by using: 

(5) 

tj,PM(fu\f:)£t'j>M + i) (6) 



(7) 



fnifn 


= 1 
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^ fi + lj- 


= E 
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B }i,fl 


= E 


/i+i=i 



3.1 Trie Speed-up 

For a dataset consisting of m samples (i.e., multi-locus genotypes), running the forward- 
backward algorithm independently on each sample results in a runtime of 0(mnK 3 ), where 
n is the number of SNP loci, and K is the number of founder haplotypes. However, inde- 
pendent processing of the samples leads to repeated computation of forward and backward 
probabilities corresponding to genotype prefixes (respectively suffixes) shared by multiple 
genotypes. To avoid this, GEDI builds a prefix tree, or trie, from the given multilocus geno- 
types (see Fig. [2] for an example) and then computes probabilities by performing a preorder 
traversal of the trie. Computation of backward probabilities is sped-up in a similar way 
using a trie of reversed genotypes. 
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The speed-up achieved by using tries depends on the number and the similarity of the 
samples, as well as the number of SNP loci. For example, when performing imputation using 
10 flanking SNPs on the 2502 samples of the IMAGE dataset described in next section, using 
tries gives a speed-up of 3 x . 



4 Experimental Results 



For empirical evaluations of GEDI's error detection and phasing algorithms, and for a com- 
parison of imputation algorithms implemented by GEDI and several other publicly available 
software packages including [2], EH HH [T2J, [131 EH] we direct the reader to [9], [5], respec- 
tively [1]. Here we present experimental results exploring the effect of GEDI's user-selected 
parameters on imputation accuracy. 

Imputation experiments were performed on the Perlegen 600k genotype data (dbGaP 
accession number phs000016.vl.pl) generated by the International Multisite ADHD Ge- 
netics (IMAGE) project, comprising 958 parents-child trios from seven European countries 
and Israel. After excluding trios with one or more samples removed by data cleaning steps 
described in [13], we randomly selected 100 trios and phased them using the entropy min- 
imization algorithm and pooled parental haplotypes with the 120 CEU haplotypes from 
Hapmap release 22 to form a reference panel of 520 haplotypes. The test data consisted of 
the genotypes of remaining 2502 IMAGE individuals, treated as unrelated unless otherwise 
indicated. Specifically, we masked 9% of the typed SNP loci on chromosome 22 (530 out of 
5835), and computed the imputation error rate as the percentage of discordant imputations 
out of the total of 1,326,060 masked SNP genotypes. In all imputation experiments we used 
10 typed SNP loci on each side of masked loci, which, as shown in Fig. [3[ yields an excellent 
tradeoff between accuracy and runtime. 




■ 7 Founders- Error Rate 
13 Founders- Error Rate 

■ 7 Founders- Time 
13 Founders- Time 



# Flanking SNPs 



Figure 3: Imputation error rate and runtime for varying number of flanking typed SNP loci 
(IMAGE chr. 22 dataset, 520 training haplotypes). 
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Figure 4: Imputation error rate on the IMAGE chr. 22 dataset for varying numbers of HMM 
founders and training haplotypes. 

Table 1: Imputation error rate on the IMAGE chr. 22 dataset for varying numbers of HMM 
founders and training haplotypes. 



# Training 
Haplotypes 


# Founders 


3 


5 


7 


9 


11 


13 


15 


30 


17.02% 


13.65% 


13.11% 


12.82% 


12.27% 


12.37% 


12.47% 


60 


15.21% 


11.10% 


10.00% 


9.75% 


9.62% 


9.59% 


9.55% 


90 


14.82% 


10.35% 


9.58% 


9.04% 


8.63% 


8.71% 


8.57% 


120 


14.39% 


10.11% 


8.93% 


8.52% 


8.30% 


8.23% 


8.13% 


220 


13.73% 


9.42% 


8.28% 


7.58% 


7.26% 


7.27% 


7.16% 


320 


14.31% 


9.53% 


7.91% 


7.37% 


6.94% 


6.81% 


6.78% 


420 


14.10% 


8.82% 


7.70% 


7.09% 


6.75% 


6.56% 


6.51% 


520 


13.54% 


9.38% 


7.48% 


6.86% 


6.61% 


6.47% 


6.33% 



Fig. [Hand Tabled] give GEDI imputation accuracy when the number of HMM founders is 
varied between 3 and 15 and the number of training haplotypes is varied between 30 and 520. 
Accuracy improves significantly when using reference panels larger than the commonly used 
Hapmap panels, particularly in conjunction with increasing the number of HMM founders. 
For example, compared to the GEDI settings used in [I] (120 training haplotypes and 7 
founders), increasing the number of training haplotypes to 520 and the number of founders 
to 15 yields an accuracy gain of over 2.5%. 

Although the accuracy gained by using a larger number of HMM founders comes at the 
cost of increased imputation time, the latter remains practical for up to 15 founders, above 
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which accuracy gains become very small. Indeed, as shown in Fig. [5j GEDI optimizations 
such as the trie speed-up described in Section 13.11 lead to sub-cubic runtime growth within 
the tested range of HMM founders, allowing users to better control the tradeoff between 
imputation speed and accuracy. 

GEDI is also able to exploit pedigree information when available. For genotype data of 
related individuals, imputation probabilities (and log-likelihood ratios) are computed jointly 
over parents-child trios, using an extended version of the forward-backward algorithm in 




Figure 6: Effect of using pedigree information during imputation (IMAGE chr. 22 dataset, 
13 HMM founders). 
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Table 2: Comparison of two GEDI imputation flows on a version of the IMAGE chr. 22 
dataset generated by randomly inserting 1% errors and 1% missing data (520 training hap- 
lo types) . 



GEDI flow 


7 Founders 


13 Founders 


Error Rate CPU sec. 


Error Rate CPU sec. 


IMP 

EDC+MDR+IMP 


8.17% 410 
8.07% 4,153 


7.20% 1,637 
6.91% 15,937 



Section [3] (see [9] for details). Fig. [6]) compares the imputation error achieved by running 
GEDI with 13 HMM founders on the IMAGE dataset under two scenarios: (a) treating the 
2502 test individuals as unrelated (as we have done in all previous experiments), and (b) 
analyzing them as 834 parents-child trios. Performing trio-based imputation reduces error 
rate by 0.22-0.44%, depending on the number of haplotypes used for training the model, 
pointing out to the value of using pedigree information. 

Finally, we conducted experiments to assess the value of performing genotype error cor- 
rection and missing data recovery prior to imputation. We generated a version of the IMAGE 
chr. 22 dataset generated by randomly inserting 1% errors and 1% missing data at typed 
SNP loci, and then ran two different analysis flows provided by GEDI: 

• In the first flow, referred to as IMP, genotypes at untyped SNP (the same as those used 
in previous experiments) were imputed based on the genotype data at typed SNPs and 
HMM models trained using 520 reference haplotypes. 

• In the second flow, referred to as EDC+MDR+IMP, we first trained an HMM model 
over typed SNPs using the 520 reference haplotypes together with haplotypes inferred 
by phasing all test genotypes. This model was next used to run GEDI's error detection 
and correction and missing data recovery functions, replacing every SNP genotype gi for 
which the likelihood ratio max^ Pm(s[9% ^~ x ])/Pm{&) is greater than 10 3 , respectively 
every missing SNP genotype g i} by argmax x P M (g[<?i <— %])■ Finally, imputation was 
performed as in the IMP flow, but based on the modified genotype data for typed 
SNPs rather than the original genotypes. 

Table [2] gives the error rate and runtime for running the two flows with 7, respectively 13 
HMM founders. Performing the EDC+MDR+IMP flow improves accuracy over direct im- 
putation in both cases, by almost 0.3% in the case of 13 founders. While EDC+MDR+IMP 
requires about 10 x more time for the IMAGE dataset used in our experiment, the runtime 
increase should be much smaller for typical GWAS datasets, for which the number of typed 
loci is typically smaller than that of untyped loci. Indeed, for such datasets imputation 
time (which grows linearly with the number of untyped loci) is likely to dominate the time 
needed for performing error detection and correction and missing data recovery (which is 
proportional to the number of typed loci). 

While the accuracy gains obtained by using pedigree information or performing the 
EDC+MDR+IMP flow are small, they can translate in non-negligible cost savings. In- 
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deed, as noted by [7], each 1% gain in imputation accuracy translates into a 5-10% reduction 
in the sample size needed to achieve a desired statistical power level. 
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